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Abstract 

Multimodal learning with deep Boltzmann ma¬ 
chines (DBMs) is an generative approach to 
fuse multimodal inputs, and can learn the shared 
representation via Contrastive Divergence (CD) 
for classification and information retrieval tasks. 
However, it is a 2-fan DBM model, and can¬ 
not effectively handle multiple prediction tasks. 
Moreover, this model cannot recover the hidden 
representations well by sampling from the con¬ 
ditional distribution when more than one modal¬ 
ities are missing. In this paper, we propose a K- 
fan deep structure model, which can handle the 
multi-input and muti-output learning problems 
effectively. In particular, the deep structure has 
K-branch for different inputs where each branch 
can be composed of a multi-layer deep model, 
and a shared representation is learned in an dis¬ 
criminative manner to tackle multimodal tasks. 
Given the deep structure, we propose two ob¬ 
jective functions to handle two multi-input and 
multi-output tasks: joint visual restoration and 
labeling, and the multi-view multi-calss object 
recognition tasks. To estimate the model param¬ 
eters, we initialize the deep model parameters 
with CD to maximize the joint distribution, and 
then we use backpropagation to update the model 
according to specific objective function. The 
experimental results demonstrate that the model 
can effectively leverages multi-source informa¬ 
tion and predict multiple tasks well over compet¬ 
itive baselines. 


1 Introduction 

We are exploring the multimodal learning in a joint frame¬ 
work when we have multiple forms of data available in the 


information age. such as images, labels, texts and videos. 
Each modality is characterized by very distinct statisti¬ 
cal properties, but it also reflects one or two facets of the 
data even though they come from different input channels. 
Thus, it is possible to leverage different inputs to learn a 
shared representation in the prediction tasks, such as data 
restoration and classification. Recent advances in deep 
learning Qj and multi-modality learning (2| shed lights on 
joint representation learning which captures the real-world 
concept that the data corresponds to. The deep learning 
methods Q]3I, such as deep belief networks (DBNs) 0. 
CNN @ |7) and LSTM (S). can learn an abstract and ex¬ 
pressive representations, which can capture a huge num¬ 
ber of possible input configurations. Hence, the represen¬ 
tation learned is useful for classification and information 
retrieval. The multimodal learning model 0 in a sense 
extends the deep learning framework, such as deepautoen- 
coder or deep Boltzmann machines (DBMs), to handle dif¬ 
ferent modalities. Thus it can learn a joint representation 
such that similarity in the code space implies similarity 
of the corresponding concepts. However, these previous 
multi-modality models (2)[5j are kind of 2-fan deep model, 
and can only handle or predict one task. Often, the joint 
representation learned is not robust enough when the data 
is typically very noisy and there may be missing. Fur¬ 
thermore, how to leverage multi-source information from 
multimodalities is also an interesting topic for classifica¬ 
tion and information retrieval. 

In this paper, we propose a K-fan deep structure model, 
where we generalize the previous 2-fan multimodal learn¬ 
ing O [5) to handle multiple inputs and outputs. Our model 
is composed of K-branch deep models, with a shared hid¬ 
den layer. In particular, the deep structure model has K 
pathway for different inputs respectively, and learn a shared 
representation to tackle multimodal tasks in an discrimina¬ 
tive manner. Our model is powerful because each branch 
can be a multi-layer deep model, for example, we can use 
DBN, CNN or LSTM in each branch to handle different 
modalities, such as images, texts and videos. Refer to the 
right panel in Fig. Q] with DBNs used in each branch for 
a 3-fan deep structure case. Most similar to our work are 
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Figure 1: The left is an example of 2-fan multimodal DBM. 
with shared representations via the 3-layer deep architec¬ 
ture. The middle is a 3-fan DBM. with a joint representa¬ 
tion shared by v r , v y and v~. The right is a 3-fan DBN. 
which is different from DBM and can be easily generalized 
into K-fan deep structure by extending more branches from 
the shared representations. Note that except the shared hid¬ 
den nodes, each branch can be different deep models, such 
as DBN, CNN and LSTM. with different number of layers 
and also different number of nodes in each layer. 


the bi-modal deep models (215) to handle image-text data 
or speech-vision data. The multimodal DBM proposed by 
Ngiam et al. (2) used a deep autoencoder for speech and vi¬ 
sion fusion, while the other method 0 leveraged DBMs to 
learn hidden representations for bi-modal image-text data. 
There are, however, several crucial differences between our 
model and other methods. First, in this work we can handle 
the K-fan different modalities, instead of bi-modal inputs. 
Moreover, each branch can be a deep model, such as DBN, 
CNN and LSTM to handle different multimodalities. Thus, 
our model is more powerful than 2-fan modality models. 
Secondly, our deep structure can jointly learn from multiple 
inputs to predict multi-output with shared representations. 
Lastly, given the deep K-fan structure, it is very flexible to 
design an objective function and learn the model parame¬ 
ters in an discriminative manner by fusing multiple inputs. 

In this paper, we use a composition of multiple DBNs as 
a special case in our K-fan model. Note that other deep 
model such as CNN and LSTM can be used in each branch 
to handle more complex data. In the learning stage, all in¬ 
puts are thought as observed data. Thus, we pretrain the 
model by leveraging all channels to learn the joint repre¬ 
sentations via Contrastive Divergence (CD). Then, we fine- 
tune the model parameters via backpropagation according 
to different tasks. In the inference stage, our model uses 
the feed-forward to handle multiple inputs and output pre¬ 
dictions. 

We test our model on two different tasks: joint visual 
restoration and labeling, as well as multiclass object recog¬ 
nition task. As for the first experiment, it is a multi-task 
learning problem, which need to jointly restore the data as 
well as label it. While for the second experiment, we lever¬ 
age multiple inputs or resources to improve the prediction 
accuracy. The experimental results demonstrate the advan¬ 


tages of our model over other competitive baselines, such 
as multimodal DBM and support vector machines (SVM). 

2 Related work 

Over the past few years, there have been several approaches 
proposed to learning from multimodal data. For example, 
a joint model of images and text using dual-wing harmo¬ 
niums is builded by Xing et al. which is a gener¬ 
ative model and can be viewed as a linear RBM model 
with Gaussian and Poisson visible units. Huiskes et al. 
m used standard low-level image features with additional 
captions, or tags, to improve classification accuracy sig¬ 
nificantly over SVM. A similar approach (Til , based on 
multiple kernel learning framework, was also proposed and 
demonstrated that an additional text modality can improve 
the accuracy of SVMs on various object recognition tasks. 

Recently, the multimodal learning with deep learning has 
attracted great attention in machine learning community. 
The approach of Ngiam et al. (2) used a deep autoencoder 
for speech and vision fusion. And a multimodal DBM 0 
is also proposed, which can be viewed as a composition 
of unimodal undirected pathways. Each pathway can be 
pretrained separately in a completely unsupervised fashion, 
with a large supply of unlabeled data. Any number of path¬ 
ways each with any number of layers could potentially be 
used. One advantage of the multimodal DBM is that it is 
a generative model, which allows the model to naturally 
handle missing data in one channel by sampling. However, 
it cannot effectively handle data missing in multiple chan¬ 
nels. More specifically, it cannot be used to predict multiple 
tasks. 

In this paper, we propose a unified K-fan deep neural net¬ 
work, which can learn a shared representation to handle 
different tasks. Moreover, each branch can be composed of 
a multi-layer deep model, such as CNN. DBN and LSTM 
to handle different inputs. Thus, our model is more power¬ 
ful by generalizing previous multimodal models J2 [5) for 
more complex tasks. We initialize the model parameters 
in a generative manner with CD algorithm. While in fine- 
tuning stage, we can update the model parameters by op¬ 
timizing the given objective functions. Thus, our model 
can leverage multiple resources and also handle or pre¬ 
dict multiple outputs. We test our model on two different 
tasks: joint visual restoration and recognition, and multi- 
view multi-class object recognition. 

As for the former task, usually, the visual restoration and 
recognition were addressed in separated pipeline, for ex¬ 
ample image denoising followed with recognition. One re¬ 
lated work is denoising autoencoder fT21 < which extends 
the work fTTU T) and minimizes the reconstruction loss be¬ 
tween the input (the corrupted data) and the output (the 
clean version of it), to learn feature representation. Re¬ 
cently. a robust Boltzmann machine (RoBM) fl4l was in- 























troduced for recognition and denoising. This model added 
another shape RBM to the Gaussian RBM prior to model 
the noise variables which indicate where to ignore the oc¬ 
cluder in the image. Thus, the RoBM contains a multi¬ 
plicative gating mechanism to handle unexpected corrup¬ 
tions of the observed variables. However, the experiments 
only show its effectiveness for regular structural noise. 

As for the multi-view multi-class object recognition. Tor- 
ralba et al. 031 proposed the joint boosting method to 
learn shared representations for multi-view multi-class ob¬ 
ject detection. Huiskes et al. flOl used SVMs to leverage 
the additional view information to boost the classification 
performance. 

We compared our method to other competitive baselines 
such as multimodal DBM and SVMs. and the experimental 
results show the advantages of our model over a variety of 
tasks. 

3 K-fan multimodal deep model 

Our K-fan deep structure is composed of K deep models in 
each branch, with a shared representation to handle multi¬ 
modalities. For clarity, we use the DBNs as the deep model 
in each branch in our K-fan deep structure to explain our 
model. More specifically, we use restricted Boltzmann ma¬ 
chines (RBMs) as the building blocks in each branch. In 
the pretraining stage, we initialize the parameters with CD 
algorithm as the deep Boltzmann machines. In the fine- 
tuning stage, our model is more like the generalized deep 
autoendcoder with multiple pathways. In the following 
parts, we will review RBMs first, and then we will intro¬ 
duce our model. 

3.1 Background 

An RBM with n hidden units is a parametric model of the 
joint distribution between a layer of hidden variables h £ 
{0,1}" and the observation v £ {0,1} D . The RBM joint 
likelihood takes the form: 


where logistic(x) = 1/(1 + e~ x ). To learn RBM pa¬ 
rameters. we need to minimize the negative log likelihood 
—logp(v) on training data, the parameters updating can 
be calculated with an efficient stochastic descent method, 
namely contrastive divergence (CD) 0. Thus, we get the 
following stochastic gradient for W from CD, 


cfiogp(v) 
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where we ignore the biases of both hidden and observation 
layers. And update 0 until convergence with gradient de¬ 
scent 


0 = 0 + i] 


dlogp(v) 
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where 0 is the weight and biases, and // is the learning rate. 


A deep (restricted) Boltzmann machine (DBM) is a stack 
of RBMs. in which each layer can capture complicated and 
higher-order correlations between the activities of hidden 
features in the layer below Iffifl . For a two layer DBM. it 
contains one layer visible units v £ {(). 1}^ and two layer 
hiddens variables h^ 1 ) £ {0,l} n * and h <2) £ {0, l}" 2 . 
The energy of the joint configuration { v. h* 1 '. h (2 *} is de¬ 
fined as (ignoring bias terms): 


E(v,h:0) = —v' r W (1, h (l) - h (1)T W ,2> h ,2 > (6) 


where h = {h (1 \h (2) } represent the set of hidden units, 
and 0 = {W <n , W l2) } are model parameters, represent¬ 
ing visible-to-hidden and hidden-to-hidden symmetric in¬ 
teraction terms. Similar to RBMs. this binary-binary DBM 
can be easily extended to modeling dense real-valued or 
sparse count data, which has been extensively discussed in 
0. Then, we can get the following likelihood by marginal¬ 
izing out h 


P(v;0)= ]T P(v,h (1 \h< 2 >;<9) 
h<>),h< 2 ) 

jnp. Y ex P( — E(v, h; 0)) (7) 

hU>,h( 2 ) 
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where the energy function is 

E(v, h) = —h' Wv - b' v - c r h (2) 

And we can compute the following conditional likelihood: 

P(v|h) = I[pMh) (3a) 

i 

P(vi = l|h) = logistic(6, + ^2 W(hj)hj) (3b) 

j 

p(hj = l|v) = logistic(q + ^2 W U'*) v j) ( 3c ) 

j 


where Z(0) is the partition function. The parameters in 
Eq. [7]can be initialized with a greedy layer-wise pretrain¬ 
ing step 0. Basically, it is to learn the current parameters 
of RBMs with CD. and the learned features of the current 
layer RBM are treated as the “data” to train the next RBM 
in the stack. The joint representation learning of multiple 
inputs, as w r ell as model parameter updating via mean-field 
will be introduced in the next part. 

3.2 The description of joint representations 

Our K-fan deep model is a deep neural network with K 
different kinds of inputs coupled stochastic binary hidden 
units in a hierarchical structure. The inputs can be binary or 
real values, and they share a hidden layer via multi-layers 





non-linear transform of RBMs for each input. For clar¬ 
ity. we will use 3-way deep structure to explain our model, 
shown in Fig. [T] 

Suppose we have a set of visible inputs v x € {0, 1 } D , 
v v € {(). 1} D and v 2 e {(). 1} ;< \ and a sequence of lay¬ 
ers of hidden units for each input. Note that v x , v v and 
v 2 can have different dimensionalities. For clarity, we 
start by modeling each input using a separate two-layer 
DBM. For input v x , and its two layers of hidden units 
h (ll) e {0,1}"*' and h (l2) 6 {0, l}” 12 , the probability 
for the visible vector v* is given by 

P(v”; = Y, P(v x . h' rI >,h<" 2 >;0 x ) (8) 

h (.ij h (*a) 

1 


ms E 

V X h <xl >,h<* 2 > X k,j j,t 
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deep structure, we first initialize the model parameters by 
maximizing the joint likelihood in Eq. [TO] Then we up¬ 
date our model parameters by optimizing different objec¬ 
tive functions according to different tasks. Note that for 
different functions, we have the same parameter initializa¬ 
tion step via CD because we use the same multimodal deep 
structure. 

3.3.1 Parameter initialization 

We first use pretraining to initialize the weights of each 
branch separately. Basically, it is to learn a stack of RBMs 
greedily layer-by-layer. To put it simply, the learned fea¬ 
tures of the current layer RBM are treated as the “data” 
to train the next RBM in the stack. Assume that we 
/tfdrc a training set D = (vf, vV, Then, for each 

branch v x , we can use RBMs to initialize the weights 
1 W( j1 ) 1 W ,i 2 ) ( W ( * 3 )) in the layer-wise manner men¬ 
tioned above. 


where 0 X = {W (xl \ W ,x2> }. Note that we only con¬ 
sider the binary observation (can be easily extended into 
real value case with Gaussian RBMs) and ignore biases for 
both visible and hidden units for clarity. Analogously, we 
can get the likelihoods for v y and v 2 respectively in the 
same formulas, but with different subscripts. 

To form our 3-fan deep model, we combine the three mod¬ 
els from v x , v J/ and v 2 , by adding an additional layer of 
binary hidden units on top of them. The resulting graphi¬ 
cal model is shown in the right panel of Fig. [T] The joint 
distribution over the multi-modal inputs can be written as: 


Then, we update the model parameters with CD. Because 
this model is intractable, we use an efficient approximate 
learning and inference GD. such as mean-field method, 
to estimate data-dependent expectations, and an MCMC 
based stochastic approximation procedure to approximate 
the model's expected sufficient statistics. In variational 
learning fT7HT%l[T51 the true posterior distribution over la¬ 
tent variables p(h|v:0) for each training vector v. is re¬ 
placed by an approximate posterior g(h|v; /t) and the pa¬ 
rameters are updated by following the gradient of a lower 
bound on the log-likelihood: 


P(h u2> , h (,,2) , h (;2) , h' 3) ) ln p( v; '/(h|v; /j)lnp(v, h; 9) + ’H(q) 


(II) 


= Mv;8) - A’L[(/(h|v;/t)||q(h|v;/<)] (12) 


P(v I ,v s ,v : ;d)= Y 

h('2),h(«»2),h( x2 ),ht3) 

( ^ P(v x .h (xl) ,h |x2 >))( P{ y' J ' h (yl> .h< y2 >)) 

h(*») h(«/») 
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where W |x3 >, W (J/3> and W^ 3 ) are respectively the 
top layer weights which connected to the top shared 
layer h (3 ^ for each input. And the parameters 0 = 
{0x,0 v ,0z, W< l3 ),W(w 3 ),W( s3 >}, where we ignore the 
biases. 

3.3 Learning and Inference 


where h = {h< xl >,h< x2 \h< yl >,h< y2 \h< 2l >,h<= 2> .h< 3 )}, 
v = {v x , v y . v 2 }. and (i is the mean-field approximation 
to hidden variable (see further), and H(-) is the entropy 
functional. To maximize the log-likelihood of the train¬ 
ing data, is to find parameters that minimize the Kullback- 
Leibler divergences between the approximating and true 
posteriors. 

Similar to pi we use the naive mean-field approach, with 
fully factorized distribution to approximate the true poste¬ 
rior: 


In learning stage, all inputs are available. Thus, we can use 
CD to learn the shared presentation and model parameter 
effectively. In practice, we divide the parameter estimation 
into two stages: parameter initialization and fine-tuning. 
And the parameter initialization stage focuses on learn¬ 
ing the joint representations shared by different modalities, 
while the fine-tuning stage emphasizes on the discrimina¬ 
tive learning according to the properties of tasks in our 
hand. More specifically, because of the joint multimodal 


9(h| V ;/i)=(nrfiv) n^vjn^rv) 

.• v k j 

n '/( / 4" i, i v ) n 9(^j- w2) i v )) (n n ^ 


(13) 


where fi = {^x\fi^} are the 



mean-field parameters with 


= 1) = uW.fbr / = 1,2. (14) 

q(h {yl) = 1) = for/ = 1,2. (15) 

q(/i (2 ° = 1) = u* 1 *, for / = 1,2. (16) 

</(/t (3) = 1) = u < 3 >. (17) 

Then // can be used to update the hidden variables in the 
data dependent item in Eq. [3] And the model dependent 
hidden variables in Eq. [T] can be sampled with MCMC. 
Then, the model parameter can be updated with CD algo¬ 
rithm according to Eq. [?] 


3.3.2 Parameter fine-tuning 

The parameter fine-tuning stage is different from previous 
methods, and is determined by the tasks that we pursue in 
our hand. In our experiments, we test our model on two 
kind of tasks: visual restoration and labeling, and multi¬ 
view multi-class object recognition. For the two tasks, we 
used the same deep structures with K=3, but with different 
objective functions. 

Joint visual restoration and labeling: For the given cor¬ 
rupted input v x , we need to restore its clear image v y as 
well as predict its label v~. Thus, we propose the follow¬ 
ing objective function for the binary case 


0 = argmin Wk 7(v x , v y , v z : 9) 

N 

= argmin y - ^ vflogvf + (1 - vf)log(l - vf) 

1=1 

N 

- Ay^ v rl°gv= + (l-vf)log(l-v?) (18) 

i=l 


where 0 is the set of weights in the 3-way deep architecture 
respectively (we ignore the subscripts for clarity), and A is 
the constant to balance the two losses in Eq. 18 And vf 
and vf are the predictions from the noise input v x , speci¬ 
fied as follows 


hi — 5l° Il-i ° ■ 0 /i( v f) (19) 

_ _/ 

L limes 

vf = gi Qffl 2 o • • • QffL (hj) ( 20 ) 

L limes 

vf = <t»i O j>2 O • • > o 0 L (h,) (21) 

s._ ^ 

L limes 

where o indicates function composition, h, is the shared 
hidden representation from the triplet (v x , v y , v : ), and the 
functions //, <//, and are non-linear projections, such as 
sigmoid function. We ignore the underscript for parameters 
in mapping functions //, gi, and $/. for / = {1,.... L} in the 
above equations. In our case, we use the same number of 
layers L to all branches for simplicity and clarity. Note 


that each branch in the deep structure network can have 
different number of layers and nodes, only if they keep the 
same dimetionality of the shared representation. 

Multi-view multi-class object recognition: Assume that 
v z is the input image contains different objects, and v y is 
the vector to describe the views to catch the object with 
cameras, and v- indicates which class the object belongs 
to. The purpose is to answer whether these additional views 
with the low level image features are helpful to improve 
the recognition accuracy. Thus, we propose the following 
objective 


0 = argmin d £(v x , v y , v 2 ; 0) 

N 

= argmin„ - ]T vflogvj + (1 - v=)log(l - v?) (22) 
1 = 1 

where 0 is the set of the weights in the 3-way deep archi¬ 
tecture as before. And vf is the prediction from the image 
vf and its view vf, specified as follows 


h <l2) =/l-i ° ° /i_(vf) 

(23) 

L — 1 limes 


h '" 21 =g L -1 ° - • °ffi(v?) 

(24) 

L— 1 limes 


hi = sigmoid (h <x2 >' W^ x3 ^ + h (y2 ^ W (y3 ^) 

(25) 

vf = 0io0 2 o...o0 L (hi) 

(26) 


L times 


where we ignore the bias term for h, for clarity. 

Given the parameter initialization with CD algorithm, we 
can minimize the objective function in Eq. [Ts]or[T?1respec- 
tively to estimate the model parameters. To fine-tune the 
model, we compute the gradients w.r.t. weights via back- 
propagation in each layer in the objective function, and then 
we use any gradient-based methods to update the model pa¬ 
rameters, such as L-BFGS Ifl9l . 

Note that the joint visual restoration and labeling task is 
different from the multi-view multi-class recognition task. 
In fact, we can see the differences between the two objec¬ 
tive functions in Eqs.[I2]and[TS] even though they used the 
same multimodal deep structure and initialized the model 
parameters with the same CD algorithm. The multi-view 
multi-class object recognition leverage multiple inputs to 
improve the classification performance, while the joint vi¬ 
sual restoration and labeling learns a joint model for mul¬ 
tiple outputs from only one input channel. The former 
has one input and two outputs, thus it is related to multi¬ 
task learning. While the latter has two inputs and one out¬ 
put prediction, by leveraging multi-source information for 
multi-classification problem. 










3.4 Relationship to other models 

We analyzed the differences between our model and other 
deep structures, such as deep autoencoder and deep Boltz¬ 
mann machines. 


3.4.1 deep autoencoder 

The deep autoendcoder m can be thought as a bi-modal 
deep model with feed-forward networks. It consists of en¬ 
coder and decoder in order to recover the data itself by 
learning the shared hidden representation. On the contrary, 
our model is a K-way deep feed-forward neural network 
with multiple channels and our model can adjust to differ¬ 
ent optimization problems given the same architecture, for 
example the joint visual restoration and labeling. Although 
these two models can use the same CD algorithm to initial¬ 
ize model parameters, our model is more powerful to han¬ 
dle multiple output predictions, instead of just recovering 
the data in the deep autoencoder. 

3.4.2 deep Boltzmann machines 

A multimodal DBM can be viewed as a composition of uni- 
modal undirected pathways. Each path-way can be pre¬ 
trained separately in a completely unsupervised fashion, 
which make it possible to leverage a large supply of un¬ 
labeled data. The middle graph in Fig. [T| is a 3-fan mul¬ 
timodal DBM. In our framework, we define a K-fan deep 
structure, where each branch can be a deep learning model, 
such as DBN. LSTM and CNN for different multimodal 
inputs. Thus, our model is more powerful. Moreover, the 
multimodal DBM is a generative model, while our model 
is a discriminative model. In other words, our modal is a 
multi-path feed-forward neural network with a shared rep¬ 
resentation. which can be optimized in an discriminative 
manner to handle different tasks. 


3.4.3 Multi-task learning 

Multi-task learning is an approach to learn a problem to¬ 
gether with other related problems at the same time, using 
a shared representation. Our multiple deep neural network 
can predict multiple outputs by learning a shared represen¬ 
tation for multi-task. Thus, our model can be used to han¬ 
dle multi-task learning problems, such as the joint visual 
restoration and labeling. In addition, our model can lever¬ 
age multiple inputs or resources to improve the prediction 
or classification, such as the multi-view multi-class object 
recognition in our case. Thus, our deep K-fan structure is 
flexible and powerful, and can be optimized according to 
different tasks. 


4 Experiments 

As mentioned before, we test our deep model with two 
tasks: joint visual restoration and labeling, and multi-view 
multi-class object recognition. For the former task, we 
evaluated the performance with Peak signal-to-noise ra¬ 
tio (PSNR). In addition, we also used error rate to evalu¬ 
ate whether denoising is helpful or not in the recognition 
task. As for the multi-view multi-class object recognition, 
we leverage the lower level image features with additional 
multiple sources, such as camera views to improve the clas¬ 
sification accuracy. 

4.1 Data description 

The MNIST datasef] consists of 28 x 28-size images of 
handwriting digits from 0 through 9 with a training set of 
60.000 examples and a testing set of 10.000 examples, and 
has been widely used to test character denoising and recog¬ 
nition methods. A set of examples are shown in Fig. [2] a). 
In the experiment, we test both denoising and recognition 
performance. As for the noise model, we considered the 
structural noise that is hard to remove in the handwriting 
images. Basically, we random add two strokes to each dig¬ 
its. refer the structural noise in Fig. [5jb), which shows 
heavily corrupt images, with more than 50% regions. 

The USPS Handwritten binary AlphadigifcQ are binary 
images with size 20 x 1(5 pixels. There are digits of “0” 
through “9” and capital **A” through "Z'\ with 39 examples 
of each class. In our experiments, we only test our method 
on the binary Alphabets. 

The multi-view multi-class datasef] consists of 23 min¬ 
utes and 57 seconds of synchronized frames taken at 25fps 
from 6 different calibrated DV cameras. One camera was 
placed about 2m high of the ground, two others where lo¬ 
cated on a first floor high, and the rest on a second floor 
to cover an area of 22 m x 22 m. This ground truth con¬ 
tains 242 annotated multi-view non-consecutive frames. 
These frames contain different real situations where pedes¬ 
trians, cars and buses appear and can cause high occlusions 
among them. In our task, we want to test whether the addi¬ 
tional multi-view information is helpful or not in the multi- 
classification task. Hence, we crop all objects in the frames 
according to its groundtruth bounding box, and then we re¬ 
size them into the same size 88 x 64. Then, we vectorize all 
the cropped objects into 5632 dimensions, with additional 
6 dimension camera view information to each object. Fi¬ 
nally, we get 4907 instances, with 1295 persons. 3354 cars 
and 58 buses. 


’http://yann.lecun.com/exdb/mnist/ 
‘http://www.cs.nyu.edu/-roweis/data/ 
binaryalphadigs.mat 

'http://cvlab.epfl.ch/data/multiclass 



4.2 Experimental setting 

In all experiments, we use the 3-fan deep model with 3- 
layer nonlinear mapping in each pathway, with the learn¬ 
ing rate 0.1 and CD-I step sampling to initialize the model 
weights. We set the constant A = 1 to weigh the two term in 
the objective function in Eq. [Ts] In the fine-tuning state, we 
used L-BFGS to update the model parameters. In particu¬ 
lar. for the joint restoration and labeling, we used CD (mul¬ 
timodal DBMs) to initialize the weights, and then optimize 
Eq. [TH] to fine-tune the model parameters with L-BFGS. 
For the MNIST digits and the USPS alphabets, we set the 
number of hidden nodes [400 200 250] for each pathway 
respectively in the 3-layer deep model. 

As for the multi-view multi-class object recognition task, 
we used the same CD to initialize the weights by maxi¬ 
mizing the joint likelihood, and then optimize Eq. [21] to 
fine-tune the model parameters. And we evaluate the per¬ 
formance with 10-fold cross validation. As for the model 
structure, we set the number of hidden nodes [400 200 250] 
in each pathway respectively in the 3-layer deep model. 

4.3 Experimental results 

We evaluated our method on both joint visual restoration 
and labeling and multi-view multi-class object recognition 
tasks. 

4.3.1 Joint restoration and labeling 

In this pail, we tested our model on the multi-task learning 
problem: joint restoration and labeling by minimizing Eq. 
|T£| Before analyzing the performance, we first evaluated 
the PSNR lower bound on the noise images, as well as the 
accuracy upper bound on the clear images. 

PSNR and error rate bounds: (1) generating the noisy 
digits: we added noise to each MNIST image by randomly 
drawing two strokes to construct its noisy observation (clut¬ 
ter more than 50% regions). Thus, for all the clear training 
and testing digits, with 60,000 training and 10,000 testing 
images respectively, we can construct the corresponding 
noisy images. (2) training a classifier on the clear data: we 
first learned the deep neural network (DNN) 0 for classifi¬ 
cation tasks by minimizing the cross entropy. In the exper¬ 
iment on the MNIST digits, we used the default DDN pa¬ 
rameters. namely 4-layer deep structure with hidden nodes 
[500 500 2000 10] respectively for each layer. Then we 
trained the model on the 60.000 clean images to learn the 
DNN classifier and then tested it on the 10.000 clean testing 
set. The error rate on the clean testing set we can get us¬ 
ing DNN is 1.2%, while the error rates on the noisy testing 
set is 61.0% heavily corrupted noise in Fig. |2]b). And the 
PSNR lower bound on the noisy digits is 7.65 dB. which is 
calculated on the noisy testing set. 


Model 

PSNR (dB) 

Error rate (%) 

Wiener [201 

11.7 

58.5 

RoBM fTTl 

13.9 

52.6 

DAE 

13.58 

35.9 

Our method 

18.6 

12.7 

DNN m 

> 7.65 

1.20 ~ 61.0 


Table 1: The experimental comparison on the heavily cor¬ 
rupted MNIST digits. It indicates that our method is better 
than competitive baselines on both denoising and recogni¬ 
tion tasks. 


Evaluation: To test our joint learning model, we trained 
our model on the 60.000 triplets: clean digit, noisy digit 
and its label. Then, we tested it on the 10.000 noisy testing 
dataset to restore its clear image and also predict its label. 
The denoised result (random sampled) of our model was 
shown in Fig.[2jd). and its quantitative result was shown in 
Table |H 

We compared our method to competitive baselines, which 
has separated pipelines, restoring image first and then rec¬ 
ognizing it. Note that all the quantitative results of base¬ 
lines are evaluated on the denoised images using the DNN 
classifier. More specifically, for each baseline, we first 
use it to denoise images, and then we use the DNN clas¬ 
sifier to evaluate the recognition accuracy on the denoised 
images. The denoising result with denoising autoencoder 
(DAE)(T2J is shown in Fig. [2]c). and its recognition accu¬ 
racy is shown in Table [T] And it clearly demonstrates that 
our joint learning method is better than DAE and ROBM 

fra . 

We also evaluated our method on the USPS alphabets. Sim¬ 
ilar to the experiment on the MNIST digits, we added ran¬ 
dom strokes to the alphabets to create the noisy observa¬ 
tions. Because there are only 39 training images for each 
class, we generated 10 corrupted samples for each clean 
image. In the experiments, we divided the total 39 x 26 bi¬ 
nary images into training set (accounting for 80%) and test¬ 
ing set (the rest 20%), and we learned a 2-layer DNN model 
with 100 and 64 hidden nodes in each layer respectively on 
the clean training set. The classification performance on 
the clean testing set is 1.29% in Table [2] while the error 
rate on the corresponding noisy testing set is 67.4%. Then 
we used the learned DNN model to evaluate the denois¬ 
ing performance for the baselines. The visual performance 
of denoise autoencoder is shown in Fig. |3Jc). and the re¬ 
sult of our model is shown in Fig. [3]d). The quantitative 
comparison between our method and the baselines is shown 
in Table [2] which demonstrates our method for joint visual 
restoration and labeling is better than competitive baselines 
on both denoising and recognition tasks. 
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Figure 2: The denoising results comparison on the heavily occluded MNIST digits, (a) original images; (b) noisy images 
with random structures; (c) denoising results with denoising autoencoder; (d) denoising results with our joint restoration 
and labeling model. 


Model 

PSNR (dB) 

Error rate (%) 

Wiener 1201 

14.2 

67.8 

robm fra 

16.3 

62.8 

DAE 

19.2 

42.5 

Our method 

19.6 

32.8 

DNN CQ 

> 8.12 

1.29 ~ 67.4 


Table 2: The experimental comparison on USPS alpha¬ 
bets. The PSNR value of DNN is 8.12 dB, which shows the 
lower bound on the noisy testing set. The error rate using 
DNN means that the error rates on the noisy testing set and 
the original clean testing set are 67.4% and 1.29% respec¬ 
tively. It demonstrates that our method is better than com¬ 
petitive baselines on both denoising and recognition tasks. 

4.3.2 Multi-view multi-class recognition 

In this part, we test our model on the multi-view multi-class 
recognition task. In this task, our model has two inputs and 
one output prediction, thus it can leverage multiple sources 


to boost classification accuracy. 

The multi-view multi-class dataset contains total 4907 x 
5632 instances, belonging to 3 classes. And there are ad¬ 
ditional 6 bits camera view information for each instance. 
This multi-class dataset is unbalanced with 1295 persons. 
3354 cars and 58 buses. The purpose is to answer whether 
the additional camera information is helpful or not in the 
multi-classification tasks. In our experiment, we use 10- 
fold cross validation, by randomly sampling 9 fold for 
training and the rest for testing. To train our 3-fan multi¬ 
modal deep model, we can leverage both lower level image 
features and multi-view information as two inputs, and pre¬ 
dict the object class in the output. We optimize our model 
via Eq. [25]and show our result in Table [2j The SVM with 
multi-view yields better result than SVM w/o multi-view, 
which indicates that the additional multi-view information 
is helpful. And our model outperforms multimodal DBM 
and SVM. which clearly demonstrates that our 3-fan deep 
model can effective leverage multiple sources to boost per- 













(b) (c) 



Figure 3: The denoising results comparison on USPS alphabet (the type 2 noise), (a) original images from ‘A‘ to ‘Z‘ ar¬ 
ranged in the top-down manner; (b) noisy images with random structures; (c) denoising results with denoising autocncoder; 
(d) denoising results with our 3-fan deep model. 


formance. It also indicates our discriminative model is bet¬ 
ter than multimodal DBM in the classification problem. 

To sum up, we can optimize different objective functions 
to achieve different goals. For example, if we want to han¬ 
dle multi-task learning, we can use the objective function 
in Eq. [T^Jfor multiple outputs. If we want to leverage mul¬ 
tiple sources to boost classification performance, we can 
propose a similar objective function as in Eq. [22) However, 
no matter what objective function we use, we still use the 
same K-fan deep multimodal structure. Thus, our model is 
flexible and powerful to leverage multiple inputs for multi- 


Model 

Error rate (%) 

SVM w/o multi-view 

9.59 

SVM w multi-view 

6.92 

Multimodal DBM 0 

7.10 

Our model 

4.80 

— 


Table 3: The experimental comparison on the multi-view 
multi-class recognition dataset. It demonstrates that our 
model outperforms other methods significantly. 

pie predictions. 

























5 Conclusions 

In this paper, we propose a generalized K-fan deep struc¬ 
ture. with shared representations for multimodal learning. 
Our deep multimodal structure is powerful because each 
branch can be a deep learning model for different inputs. 
Given the deep model, we can optimize different objec¬ 
tive functions to learn the shared representation from mul¬ 
timodalities to handle different tasks. To learn the model 
parameters, we take a two stage steps: parameter initializa¬ 
tion and parameter fine-tuning. In the parameter initializa¬ 
tion stage, we use the CD algorithm to maximize the joint 
likelihood as the multimodal DBM does. In the fine-tuning 
stage, we update the model parameter according to the de¬ 
fined objective function. We test our model on two tasks: 
the joint restoration and labeling, and the multi-view multi¬ 
class object recognition. The former task is to handle the 
multi-task learning problem, while the latter is to answer 
whether our model can leverage multiple sources to boost 
classification performance. The experimental results show 
our K-fan deep structure model is flexible and powerful, 
and can be optimized according to different functions to 
address different problems. 
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